Back

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Institute of Electrical and Electronics Engineers (IEEE)

Preprints posted in the last 30 days, ranked by how well they match IEEE/ACM Transactions on Computational Biology and Bioinformatics's content profile, based on 32 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
Benchmark of biomarker identification and prognostic modeling methods on diverse censored data

Fletcher, W. L.; Sinha, S.

2026-04-01 bioinformatics 10.64898/2026.03.29.715113 medRxiv
Top 0.1%
3.6%
Show abstract

The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.

2
TF-IDF k-mer-based Classical and Hybrid Machine Learning Models for SARS-CoV-2 Variant Classification under Imbalanced Genomic Data

Haque, N.; Mazed, A.; Ankhi, J. N.; Uddin, M. J.

2026-04-02 bioinformatics 10.64898/2026.04.02.716024 medRxiv
Top 0.1%
3.2%
Show abstract

Accurate classification of SARS-CoV-2 genomic variants is essential for effective genomic surveillance, yet it is challenged by extreme class imbalance, limited representation of rare variants, and distribution shifts in real-world sequencing data. In this study, we employed hybrid RF-SVM framework designed for robust detection of rare SARS-CoV-2 variants. It integrates a random forest and a polynomial-kernel based support vector machine to enhance sensitivity to minority classes while maintaining overall predictive stability. We systematically compared classical machine learning models, deep learning approaches, and hybrid strategies under both standard and distribution-shifted evaluation settings. Our results show that classical models using TF-IDF-based k-mer features outperform deep learning methods on macro-averaged performance metrics. The Random Forest classifier using TF-IDF Feature achieved the best overall performance, with a macro-averaged F1-score of 0.8894 and an accuracy of 96.3%. The model also demonstrated strong generalization ability, as evidenced by stable cross-validation performance (CV accuracy = 0.9637). Hybrid RF-SVM model further improves rare variant detection under severe class imbalance. Calibration analysis indicates reliable probability estimates for common variants, although challenges persist for minority classes. Overall, this study highlights the limitations of deep learning in highly imbalanced genomic settings and demonstrates that carefully designed hybrid machine learning approaches provide an effective and interpretable solution for rare SARS-CoV-2 variant detection.

3
emb2dis: a novel protein disorder prediction tool based on ResNets, dilated convolutions & protein language models

Duarte, S. A.; Mehdiabadi, M.; Bugnon, L. A.; Aspromonte, M. C.; Piovesan, D.; Milone, D. H.; Tosatto, S.; Stegmayer, G.

2026-04-01 bioinformatics 10.64898/2026.03.30.715414 medRxiv
Top 0.5%
0.9%
Show abstract

Intrinsically disordered proteins (IDPs) play an important role in a wide range of biological functions and are linked to several diseases. Due to technical difficulties and the high cost of experimental determination of disorder in proteins, combined with the exponential increase of unannotated protein sequences, the development of computational methods for disorder prediction became an active area of research in the last few decades. In this work, we present emb2dis, a deep learning model that uses protein language models (pLMs) to predict disorder from sequence. The emb2dis tool is a pre-trained model that receives as input a protein sequence, calculates its pLM embedding and passes it to a deep learning model. In contrast to existing approaches, emb2dis integrates informative sequence representations with a novel architecture that combines residual networks (ResNets) and dilated convolutions. This design effectively enlarges the receptive field of the convolution operation, enabling the model to better capture an extended context of each amino acid. At the output, emb2dis assigns a disorder propensity score to each residue in the sequence. The model was evaluated on datasets from the latest CAID3 blind benchmark for disorder prediction, where it achieved first place in the Disorder-PDB category, exhibiting strong performance with high AUC and Fmax scores. Additionally, it ranked among the top ten methods on the Disorder-NOX dataset. We provide a freely available web-demo for emb2dis and a source code repository for local installation. Weblink for the toolhttps://sinc.unl.edu.ar/web-demo/emb2dis/ The importance of the emb2dis tool is that it provides a new deep learning approach and significant improvements in the prediction of protein disorder, with a simple web interface and graphical output detailing per-residue disorder.

4
IDBSpred: An intrinsically disordered binding site predictor using machine learning and protein language model

Jones, D.; Wu, Y.

2026-03-30 bioinformatics 10.64898/2026.03.27.714773 medRxiv
Top 0.8%
0.6%
Show abstract

Intrinsically disordered proteins (IDPs) mediate many cellular functions through interactions with structured protein partners, but predicting the corresponding binding sites on the structured partner remains challenging. Here, we present IDBSpred, a sequence-based method for residue-level prediction of IDP-binding sites on structured proteins. Training and test data were collected from the DIBS database, which contains more than 700 non-redundant IDP-protein complexes. Residue-level embeddings of structured partner sequences were generated using the ESM-2 protein language model and used as input to a multilayer perceptron classifier for binary prediction of binding versus non-binding residues. Analysis of amino acid composition showed that IDP-binding sites are enriched in aromatic residues, especially Trp, Tyr, and Phe, as well as several charged and polar residues, whereas Ala and several small or conformationally restrictive residues are depleted. The classifier achieved an ROC AUC of 0.87 and an average precision of 0.61. Structural case studies further showed that the predicted sites largely recapitulate the major experimentally defined binding interfaces. These results demonstrate that protein language model embeddings plus machine learning algorithms can effectively capture sequence features associated with IDP recognition on structured proteins. IDBSpred provides a practical framework for studying IDP-mediated interfaces and identifying potential therapeutic hotspots.

5
OpenMebius2: GUI-based software for 13C-metabolic flux analysis with tracer labeling pattern suggestions for accurate flux predictions

Imada, T.; Shimizu, H.; Toya, Y.

2026-03-24 bioengineering 10.64898/2026.03.20.698926 medRxiv
Top 0.8%
0.5%
Show abstract

13C-metabolic flux analysis (13C-MFA) is a crucial technique that experimentally determines metabolic flux distribution. Although precision of each flux strongly depends on tracer labeling pattern, its optimization remains challenging. We developed an integrated platform, OpenMebius2, a graphical user interface (GUI)-based software for 13C-MFA that includes a tracer labeling pattern suggestion function to support subsequent experiments. The proposed function leverages metabolic flux distributions and their 95 % confidence intervals obtained using low-cost 13C-labeled substrates to evaluate hypothetical parallel labeling scenarios and predict improvements in flux estimation precision. Availability and implementationThis software runs on Linux, macOS, and Windows. The source code and binary files are available at https://github.com/metabolic-engineering/OpenMebius2 under the PolyForm Noncommercial License 1.0.0.

6
SSPSPredictor: A Sequence and Structure based Deep Learning Model for Predicting Phase-Separating Proteins

Wang, T.; Liao, S.; Qi, Y.; Zhang, Z.

2026-04-01 bioinformatics 10.64898/2026.03.30.715224 medRxiv
Top 0.9%
0.5%
Show abstract

Liquid-liquid phase separation (LLPS) underlies the formation of biomolecular liquid condensates (also referred to membraneless organelles, MLOs), which are essential for spatially organizing various biochemical processes within cells. Proteins that play a key role in driving condensates formation are termed phase-separating proteins (PSPs). Given experimental identification of PSPs remains labor-intensive and time-consuming, multiple computational tools have been developed based on empirical features or deep learning. In this study, we propose SSPSPredictor, a novel multimodal predictive model for PSPs with folded or intrinsically disordered structures, leveraging the fusion of sequence information from a protein language model ESM-2 and structural insights from a graph neural network GVP. Compared with existing tools, SSPSPredictor achieves balanced performance in identifying endogenous PSPs, predicting relative LLPS propensities, and recognizing key regions that drive LLPS. Moreover, SSPSPredictor exhibits good interpretability in identifying driving regions along protein sequences, although no relevant supervision was provided during training. Further predictive analysis of the human proteome using SSPSPredictor reveals that the proportion of intrinsically disordered proteins (IDPs) undergoing LLPS is significantly higher than that of folded proteins. In addition, pathogenic variants, especially those located in disordered regions, exhibit higher LLPS propensity than other mutations, uncovering a link between LLPS and diseases at the amino acid level.

7
OpenAlea.HydroRoot: A modelling framework to dissect, predict and phenotype branched root hydraulic architecture

Bauget, F.; Ndour, A.; Boursiac, Y.; Maurel, C.; Laplaze, L.; Lucas, M.; Pradal, C.

2026-03-23 plant biology 10.64898/2026.03.19.713025 medRxiv
Top 1.0%
0.4%
Show abstract

Drought is a significant factor in agricultural losses, making it imperative to understand how root system architecture (RSA) adapts to environmental condition like water deficit. HydroRoot is a functional-structural plant model (FSPM) aimed at analyzing and simulating hydraulic and solute transport of RSA. The model integrates a static hydraulic solver, a coupled water-solute transport solver, a statistical generator of RSA based on Markov model, and a dynamic hydraulic model accounting for root growth. This paper presents the model, the mathematical description of the formalism of solvers, and use cases with their associated tutorials. Five use cases illustrate capabilities of HydroRoot, which has been successfully used for phenotyping root hydraulics across various species, including Arabidopsis, maize, and millet. The model-driven phenotyping method "cut and flow" is presented to characterize axial and radial conductivities on a given root genotype. Finally, three step-by-step tutorials provide a structured way to learn how to use HydroRoot 1) to simulate hydraulic on a given architecture, 2) to simulate water and solute transport on a maize root, and 3) to simulate hydraulic on two pearl millet genotypes with varying soil conditions. Hydroroot is an open-source package of the OpenAlea platform, with the code publicly available on Github. A comprehensive documentation is available with a reproducible gallery of examples.

8
Benchmarking Heritability Estimation Strategies Across 86 Configurations and Their Downstream Effect on Polygenic Risk Score Performance

Muneeb, M.; Ascher, D.

2026-04-02 bioinformatics 10.64898/2026.04.02.716079 medRxiv
Top 1.0%
0.4%
Show abstract

ObjectiveSNP heritability estimates vary substantially across estimation strategies, yet the downstream consequences for polygenic risk score (PRS) construction remain poorly characterised. We systematically benchmarked heritability estimation configurations and assessed their propagation into downstream PRS performance. MethodsWe benchmarked 86 heritability-estimation configurations spanning six tool families (GEMMA, GCTA, LDAK, DPR, LDSC, SumHer) and ten method groups across 10 UK Biobank phenotypes, yielding 844 configuration-level estimates. Each estimate was propagated into GCTA-SBLUP and LDpred2-lassosum2 PRS frameworks and evaluated across five cross-validation folds using null, PRS-only, and full models. Eleven binary analytical contrasts were tested using Mann-Whitney U tests to identify drivers of heritability variability. ResultsHeritability ranged from -0.862 to 2.735 (mean = 0.134, SD = 0.284), with 133 of 844 estimates (15.8%) negative and concentrated in unconstrained estimation regimes. Ten of eleven analytical contrasts significantly affected heritability magnitude, with algorithm choice and GRM standardisation showing the largest effects. Despite this upstream variability, downstream PRS test performance was only weakly coupled to heritability magnitude: pooled Pearson correlations between h2 and test AUC were r = -0.023 for GCTA-SBLUP and r = +0.014 for LDpred2-lassosum2 (both non-significant). ConclusionSNP heritability is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar input. Heritability estimates should always be reported alongside their full estimation specification, and downstream PRS performance is comparatively robust to moderate variation in the heritability input. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/716079v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@112929borg.highwire.dtl.DTLVardef@573c36org.highwire.dtl.DTLVardef@132170borg.highwire.dtl.DTLVardef@1871363_HPS_FORMAT_FIGEXP M_FIG C_FIG

9
Analysis of biological networks using Krylov subspace trajectories

Frost, H. R.

2026-03-31 bioinformatics 10.64898/2026.03.29.715092 medRxiv
Top 1%
0.4%
Show abstract

We describe an approach for analyzing biological networks using rows of the Krylov subspace of the adjacency matrix. Specifically, we explore the scenario where the Krylov subspace matrix is computed via power iteration using a non-random and potentially non-uniform initial vector that captures a specific biological state or perturbation. In this case, the rows the Krylov subspace matrix (i.e., Krylov trajectories) carry important functional information about the network nodes in the biological context represented by the initial vector. We demonstrate the utility of this approach for community detection and perturbation analysis using the C. Elegans neural network.

10
A graph-based learning approach to predict the effects of gene perturbations on molecular phenotypes

Jin, Y.; Sverchkov, Y.; Sushkova, A.; Ohtake, M.; Emfinger, C.; Craven, M.

2026-03-23 systems biology 10.64898/2026.03.20.712202 medRxiv
Top 1%
0.4%
Show abstract

MotivationLarge-scale gene knockdown/knockout screens have been used to gain insight into a wide array of phenotypes and biological processes. However, conducting such experiments is expensive and labor-intensive. In this work, we present a general graph-based machine-learning approach that can predict the effects of gene perturbations on molecular phenotypes of interest given some measured phenotypic effects of other gene perturbations. The motivation for learning models that can predict the effects of gene perturbations is fourfold. Such models can (1) predict effects for unmeasured genes in cases in which cost or technical barriers preclude perturbing every gene, (2) prioritize unmeasured genes or sets of genes for subsequent perturbation experiments, (3) hypothesize mechanisms that underlie the relationships between the perturbed genes and their effects, and (4) generalize to other unmeasured phenotypes of interest. ResultsWe evaluate our approach by applying it, in conjunction with four different learning methods, to learn models for four varied phenotypes. Our empirical evaluation demonstrates that the learned models (1) show relatively high levels of predictive accuracy across the four phenotypes, (2) have better predictive accuracy than several standard baselines, (3) can often learn accurate models with small training sets, (4) benefit from having multiple sources of evidence in the input representation, (5) can, in many cases, transfer their predictive value to other phenotypes. Availability and ImplementationThe Assembled datasets and source code for this work is available at: https://github.com/Craven-Biostat-Lab/graph-molecular-phenotype-prediction

11
Multi-task deep learning integrating pretreatment MRI and whole slide images predicts induction chemotherapy response and survival in locally advanced nasopharyngeal carcinoma

Hou, J.; Yi, X.; Li, C.; Li, J.; Cao, H.; Lu, Q.; Yu, X.

2026-04-11 radiology and imaging 10.64898/2026.04.07.26350350 medRxiv
Top 1%
0.3%
Show abstract

Predicting response to induction chemotherapy (IC) and overall survival (OS) is critical for optimizing treatment in patients with locally advanced nasopharyngeal carcinoma (LANPC). This study aimed to develop and validate a multi-task deep learning model integrating pretreatment MRI and whole slide images (WSIs) to predict IC response and OS in LANPC. Pretreatment MRI and WSIs from 404 patients with LANPC were retrospectively collected to construct a multi-task model (MoEMIL) for the simultaneous prediction of early IC response and OS. MoEMIL employed multi-instance learning to process WSIs, PyRadiomics and a convolutional neural network (ResNet50) to extract MRI features, and fused multimodal features through a multi-gate mixture-of-experts architecture. Clustering-constrained attention multiple instance learning and gradient-weighted class activation mapping were applied for visualization and interpretation. MoEMIL effectively stratified patients into good and poor IC response groups, achieving areas under the curve of 0.917, 0.869, and 0.801 in the train, validation, and test sets, respectively, and outperformed the deep learning radiomics model, the pathomics model and TNM staging. The model also stratified patients into high- and low-risk OS groups (P < 0.05). MoEMIL shows promise as a decision-support tool for early IC response prediction and prognostication in LANPC. Author SummaryWe have developed a deep learning model that integrates two types of medical images, including magnetic resonance imaging (MRI) and digital pathological slices, to simultaneously predict response to induction chemotherapy and prognosis in patients with locally advanced nasopharyngeal carcinoma. Current treatment decisions primarily rely on traditional tumor staging (TNM), which often fails to comprehensively reflect the complexity of the disease. Our model, named MoEMIL, was trained and tested on data from 404 patients across two hospitals and consistently outperformed both single-model approaches and TNM staging methods. By identifying patients who exhibit poor response to induction chemotherapy or higher prognostic risk, our tool can assist clinicians in achieving personalized treatment, enabling intensified management for high-risk patients and avoiding unnecessary side effects for low-risk patients. Additionally, we visualize the models reasoning process through heat map generation, which highlights the image regions exerting the greatest influence on prediction outcomes. This work represents a step toward more precise treatment for nasopharyngeal carcinoma; however, larger-scale prospective studies are required before the model can be integrated into routine clinical practice.

12
Pneumonia Detection in Paediatric Chest X-Rays using Ensembled Large Language Models

Tan, J.; Tang, P. H.

2026-04-12 radiology and imaging 10.64898/2026.04.10.26347909 medRxiv
Top 1%
0.3%
Show abstract

Background: Paediatric pneumonia is a leading cause of childhood morbidity and mortality worldwide. Chest X-rays (CXR) are an important diagnostic tool in the diagnosis of pneumonia, but shortages in specialist radiology services lead to clinically significant delays in CXR reporting. The ability to communicate findings both to clinicians and laypersons allows MLLMs to be deployed throughout clinical workflows, from image analysis to patient communication. However, MLLMs currently underperform state-of-the-art deep learning classifiers. Objective: To evaluate the diagnostic accuracy of ensemble strategies with MLLMs compared to the baseline average agent for paediatric radiological pneumonia detection. Methods: We conducted a retrospective cohort study using paediatric CXRs from two independent hospital datasets totalling 2300 CXRs. Fifteen MedGemma-4B-it agents independently classified each CXR into five pneumonia likelihood categories. Majority voting, soft voting, and GPTOSS-20B aggregation were compared against the average agent performance. The primary metric evaluated was OvR AUROC. Secondary metrics included accuracy, sensitivity, specificity, F1-score, Cohen's kappa, and OvO AUROC. Results: Soft voting achieved improvements in OvR AUROC (p_balanced = 0.0002, p_real-world = 0.0003), accuracy (p_balanced = 0.0008, p_real-world < 0.0001), Cohen's Kappa (p_balanced = 0.0006, p_real-world = 0.0054) and OvO AUROC (p_balanced < 0.0001, p_real-world = 0.0011) across both datasets, and a superior F1-value (pbalanced = 0.0028) for the balanced dataset. Conclusion: Soft voting enhances MedGemma's diagnostic discriminatory performance for paediatric radiological pneumonia detection. Our system enables privacy-preserving, near real-time clinical decision support with explainable outputs, having potential for integration into emergency departments. Our system's high specificity supports triage by flagging high-risk radiological pneumonia cases.

13
A harmonized benchmarking framework for implementation-aware evaluation of 46 polygenic risk score tools across binary and continuous phenotypes

Muneeb, M.; Ascher, D.

2026-03-23 bioinformatics 10.64898/2026.03.22.713457 medRxiv
Top 1%
0.3%
Show abstract

Polygenic risk score (PRS) tools differ substantially in statistical assumptions, input requirements, and implementation complexity, making direct comparison difficult. We developed a harmonized, implementation-aware benchmarking framework to evaluate 46 PRS tools across seven binary UK Biobank phenotypes and one continuous trait under three model configurations: null, PRS-only, and PRS plus covariates. The framework integrates standardized preprocessing, tool-specific execution, hyperparameter exploration, and unified downstream evaluation using five-fold cross-validation on high-performance computing infrastructure. In addition to predictive performance, we assessed runtime, memory use, input dependencies, and failure modes. A Friedman test across 40 phenotype-fold combinations confirmed significant differences in tool rankings ({chi}2 = 102.29, p = 2.57 x 10-11), with no single method universally optimal. These findings provide a reproducible framework for comparative PRS evaluation and demonstrate that tool performance is shaped not only by statistical methodology but also by phenotype architecture, preprocessing choices, covariate structure, computational demands, software robustness, and practical implementation constraints.

14
Cellector: A tool to detect foreign genotype cells in scRNAseq data with applications in leukemia and microchimerism.

Heaton, H.; Behboudi, R.; Ward, C.; Weerakoon, M.; Kanaan, S.; Reichle, S.; Hunter, N.; Furlan, S.

2026-03-30 bioinformatics 10.64898/2026.03.26.714571 medRxiv
Top 1%
0.3%
Show abstract

The existence of rare, genetically distinct cells can occur in various samples such as transplant patients, naturally occurring microchimerism between maternal and fetal tissues, and cancer samples with sufficient mutational burden. Computational methods for detecting these foreign cells are vital to studying these biological conditions. An application that is of particular interest is that of leukemia patients post hematopoietic cell transplant (HCT). In many leukemias, a primary therapy is HCT, after which, the primary genotype of the bone marrow and blood cells should be of donor origin. If cells exist that are of the patients genotype and the cell type lineage of the particular leukemia, this is known as measurable residual disease (MRD). If the MRD is high enough, this may represent a relapse of the patients leukemia. Furthermore, accurately estimating the MRD is important for driving clinical decision making for these patients. Here we present Cellector, a computational method for identifying rare foreign genotype cells in single cell RNAseq (scRNAseq) datasets. We show cellector accurately detects microchimeric cells down to an exceedingly low percentage of these cells present (0.05% or lower).

15
Enhancing non-local interaction modeling for ab initio biomolecular calculations and simulations with ViSNet-PIMA

Cui, T.; Wang, Z.; Wang, T.

2026-03-20 bioinformatics 10.64898/2026.03.18.712561 medRxiv
Top 1%
0.3%
Show abstract

AI-based molecular dynamics simulation brings ab initio calculations to biomolecules in an efficient way, in which the machine learning force field (MLFF) locates at the central position by accurately predicting the molecular energies and forces. Most existing MLFFs assume localized interatomic interactions, limiting their ability to accurately model non-local interactions, which are crucial in biomolecular dynamics. In this study, we introduce ViSNet-PIMA, which efficiently learns non-local interactions by physics-informed multipole aggregator (PIMA) and accurately encodes molecular geometric information. ViSNet-PIMA outperforms all state-of-the-art MLFFs for energy and force predictions of different kinds of biomolecules and various conformations on MD22 and AIMD-Chig datasets, while adapting the PIMA blocks into other MLFFs further achieves 55.1% performance gains, demonstrating the superiority of ViSNet-PIMA and the universality of the model design. Furthermore, we propose AI2BMD-PIMA to incorporate ViSNet-PIMA into AI2BMD simulation program by introducing "Transfer Learning-Pretraining-Finetuning" scheme and replacing molecular mechanics-based non-local calculations among protein fragments with ViSNet-PIMA, which reduces AI2BMDs energy and force calculation errors by more than 50% for different protein conformations and protein folding and unfolding processes. ViSNet-PIMA advances ab initio calculation for the entire biomolecules, amplifying the application values of AI-based molecular dynamics simulations and property calculations in biochemical research.

16
RD-Embed: Unified representations of rare-disease knowledge from clinical records

Groza, T.; Tan, F.; Lim, N. T. R.; Shanmugasundar, M. W.; Kappaganthu, J.; Lieviant, J. A.; Karnani, N.; Chen, H.; Wong, T. Y.; Jamuar, S. S.

2026-04-04 genetic and genomic medicine 10.64898/2026.04.02.26350083 medRxiv
Top 1%
0.2%
Show abstract

Rare diseases often present with incomplete, evolving symptoms and signs scattered across clinical notes and coded records, making diagnosis and gene discovery difficult even when genomic data are available. Existing approaches either depend on curated phenotype profiles or use general biomedical language models that are not aligned to rare-disease knowledge, limiting performance in early or ambiguous clinical presentations. Here, we show that RD-Embed - a three-stage representation framework that builds a base space that preserves domain knowledge, aligns clinical text and SNOMED-derived signals, and refines relationships with graph-based learning - enables robust rare-disease retrieval from heterogeneous clinical records. Across ten rare-disease datasets, RD-Embed attains up to >50% top-ten diagnostic retrieval using combined text and phenotype features, compared with ~30% on average for other embedding models and similarly sized large language models. On an EHR stress test, clinical alignment substantially improves text-based retrieval compared with ontology-only representations, supporting use in routine EHR data. We suggest RD-Embed is lightweight model that can be incorporated into existing hospital systems that supports rare disease identification and diagnosis, and gene prioritization.

17
SELFormerMM: multimodal molecular representation learning via SELFIES, structure, text, and knowledge graph integration

Ulusoy, E.; Bostanci, S.; Deniz, B. E.; Dogan, T.

2026-03-19 bioinformatics 10.64898/2026.03.17.712340 medRxiv
Top 2%
0.2%
Show abstract

MotivationMolecular representation learning is central to computational drug discovery. However, most existing models rely on single-modality inputs, such as molecular sequences or graphs, which capture only limited aspects of molecular behaviour. Yet unifying these modalities with complementary resources such as textual descriptions and biological interaction networks into a coherent multimodal framework remains non-trivial, hindering more informative and biologically grounded representations. ResultsWe introduce SELFormerMM, a multimodal molecular representation learning framework that integrates SELFIES notations with structural graphs, textual descriptions, and knowledge graph- derived biological interaction data. By aligning these heterogeneous views, SELFormerMM effectively captures complementary signals that unimodal approaches often overlook. Our performance evaluation has revealed that SELFormerMM outperforms structure-, sequence-, and knowledge-based models on multiple molecular property prediction tasks. Ablation analyses further indicate that effective cross-modal alignment and modality coverage improve the models ability to exploit complementary information. Overall, integrating SELFIES with structural, textual, and biological context enables richer molecular representations and provides a promising framework for hypothesis-driven drug discovery. AvailabilitySELFormerMM is available as a programmatic tool, together with datasets, pretrained models, and precomputed embeddings at https://github.com/HUBioDataLab/SELFormerMM. Contacttuncadogan@gmail.com

18
A Novel ILP Framework to Identify Compensatory Pathways in Genetic Interaction Networks with GIDEON

Garcia, J. J.; Yu, K. M.; Freudenreich, C. H.; Cowen, L.

2026-03-31 bioinformatics 10.64898/2026.03.29.715009 medRxiv
Top 2%
0.2%
Show abstract

In Bakers yeast, there exists a comprehensive collection of pairwise epistasis experiments that, for nearly every pair of non-essential genes, measures the growth of the double-knockout strain as compared to its component single knockouts. This data can be represented as a weighted signed graph termed the genetic interaction network, and we introduce a new ILP-based method named GIDEON to search for a diverse collection of Between-Pathway Models (BPMs) in this network, where BPMs are a graph motif signature that indicates potential compensatory pathways in the genetic interaction network. With both an improved distribution-informed edge weighting scheme and an improved ILP method, GIDEON produces BPM collections that are substantially larger and with better functional enrichment compared to previous methods. We find some interesting new BPM gene sets including one with potential insights into antifungal drug targets through ties between ergosterol and aromatic amino acid biosynthesis.

19
eSIG-Net: Accurate prediction of single-mutation induced perturbations on protein interactions using a language model

Pan, X.; Shrawat, A.; Raghavan, S.; Dong, C.; Yang, Y.; Li, Z.; Zheng, W. J.; Eckhardt, S. G.; Wu, E.; Fuxman Bass, J. I.; Jarosz, D. F.; Chen, S.; McGrail, D. J.; Sheynkman, G. M.; Huang, J. H.; Sahni, N.; Yi, S. S.

2026-03-31 bioinformatics 10.64898/2026.03.27.714913 medRxiv
Top 2%
0.2%
Show abstract

Most proteins exert their functions in complex with other interactors. Single mutations can exhibit a profound impact on perturbing protein interactions, leading to human disease. However, predicting the effect of single mutations on protein interactions remains a major computational challenge. Deep learning, particularly protein language models or transformers, has become an effective tool in bioinformatics for protein structure prediction. However, the functional divergence of mutations makes it difficult to predict their interaction perturbation profiles. To address this fundamental challenge, we present eSIG-Net (edgetic mutation Sequence-based Interaction Grammar Network), a novel sequence-based "Interaction Language Model" for predicting protein interaction alterations caused by single mutations. eSIG-Net combines various protein sequence embeddings, introduces a mutation-encoding module with syntax and evolutionary insights, and employs contrastive learning to evaluate mutation-induced interaction changes. eSIG-Net significantly outperforms current state-of-the-art sequence-based and structure-based prediction methods at predicting mutational impact on protein interactions. We highlight examples where eSIG-Net nominates causal variants with high confidence and elucidates their functional role under relevant biological contexts. Together, eSIG-Net is a first-in-kind "interaction language model" that can accurately predict interaction-specific rewiring by single mutations with only sequence information, and exhibits generalizability across biological contexts.

20
Claw4Science: A Dataset and Platform for the OpenClaw Scientific Agent Ecosystem

Xu, M.; Chen, J.; Zhang, Z.

2026-04-01 bioinformatics 10.64898/2026.03.30.715118 medRxiv
Top 2%
0.2%
Show abstract

Large language models have enabled a new class of scientific software in the form of AI agents that can execute research workflows across bioinformatics, drug discovery, and related domains. Among these systems, OpenClaw introduced a skill-based design that allows workflows to be expressed as structured Markdown files, lowering the barrier to contribution and enabling rapid ecosystem growth. However, this growth has led to fragmentation. Projects are distributed across independent repositories, skills vary widely in quality, naming is inconsistent, and there is no unified way to discover or compare available tools. In this work, we construct the first curated dataset of the OpenClaw scientific ecosystem. The dataset includes 91 projects organized by functional role and 2,230 skills spanning 34 scientific categories. Based on this dataset, we perform a systematic analysis of the structure, distribution, and emerging patterns of scientific agent development. To make this ecosystem accessible in practice, we further build Claw4Science, a public platform at https://claw4science.org, which is built on top of our dataset. The platform organizes projects and aggregates distributed skill repositories into a unified interface, with a focus on bioinformatics and scientific workflows, providing a practical entry point for navigating the ecosystem. Our results show that the OpenClaw ecosystem reflects a shift from isolated systems to a more modular and shareable model of scientific computation. At the same time, challenges in evaluation, reproducibility, and governance remain open. We argue that our dataset provides a foundation for future benchmark development and standardized infrastructure for scientific AI agents.